Grades and Feedback

  • If you lost points for the dashboard, show Ethan your .Rmd document to get points back
  • Treat the projects like a formal report
    • Proofread! Make sure plots render.
    • Make your code/plots easy to read! Include comments and labels.

Grades and Feedback

  • If you lost points for the dashboard, show Ethan your .Rmd document to get points back
  • Treat the projects like a formal report
    • Proofread! Make sure plots render.
    • Make your code/plots easy to read! Include comments and labels.
  • Follow the rubric closely when grading your exercises - email me if you have any uncertainty.

Marketing Analytics Process

Inferential Modeling Workflow

Simulating Data Redux

Remember that once we have a model of the data generating process we can pretend that our model is the data generating process and simulate data.

# Load packages.
library(tidyverse)
library(tidymodels)

# What's this again?!
set.seed(42)

# Set variable and parameter values.
nobs <- 300
intercept <- 30
slope <- 9

# Simulate data.
sim_data_01 <- tibble(
  discount = round(runif(nobs, min = 0, max = 20)),
  sales = intercept + slope * discount + rnorm(nobs, mean = 0, sd = 5)
)

sim_data_01
## # A tibble: 300 × 2
##    discount sales
##       <dbl> <dbl>
##  1       18 192. 
##  2       19 193. 
##  3        6  89.8
##  4       17 182. 
##  5       13 145. 
##  6       10 114. 
##  7       15 165. 
##  8        3  53.0
##  9       13 144. 
## 10       14 162. 
## # ℹ 290 more rows

sim_data_01 |> 
  ggplot(aes(x = discount, y = sales)) +
  geom_jitter(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm")

We can use the binomial distribution to simulate binary data. In a binary variable there are two levels, with the level equal to zero known as the baseline level or reference level.

# Simulate some more data.
sim_data_02 <- tibble(
  coupon = rbinom(nobs, size = 1, prob = 0.7),
  sales = intercept + slope * coupon + rnorm(nobs, mean = 0, sd = 5)
)

sim_data_02
## # A tibble: 300 × 2
##    coupon sales
##     <int> <dbl>
##  1      1  41.1
##  2      1  43.9
##  3      1  43.2
##  4      0  26.7
##  5      1  46.8
##  6      1  30.9
##  7      1  43.3
##  8      1  36.4
##  9      1  29.4
## 10      1  29.7
## # ℹ 290 more rows

sim_data_02 |> 
  ggplot(aes(x = coupon, y = sales)) +
  geom_jitter(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm")

Fit the Model

Remember that when we fit a linear model we are finding the line of best fit and getting parameter estimates.

# Fit the first model.
fit_01 <- linear_reg() |> 
  set_engine("lm") |> 
  fit(sales ~ discount, data = sim_data_01)

# Fit the second model.
fit_02 <- linear_reg() |> 
  set_engine("lm") |> 
  fit(sales ~ coupon, data = sim_data_02)

Parameter Estimates

Remember that our goal is to use the model to estimate the unobserved parameters from the data.

# Evaluate model fit.
fit_01 |> 
  tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    29.8     0.537       55.5 2.92e-159
## 2 discount        9.01    0.0466     193.  2.46e-315

How Do We Interpret Our Models?

\[sales = 29.80 + 9.01 \times discount\]

If \(x\) is continuous:

  • The intercept parameter \(\beta_0\) represents the expected value of \(y\) when \(x\) is zero.
  • The associated slope parameter \(\beta_1\) represents the expected amount by which \(y\) will change given a one unit increase in \(x\).

What’s different again about sim_data_02?

# Evaluate model fit.
fit_02 |> 
  tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    29.6      0.518      57.2 1.00e-162
## 2 coupon          9.07     0.617      14.7 3.13e- 37

\[sales = 29.60 + 9.07 \times coupon\]

If \(x\) is discrete:

  • The intercept parameter \(\beta_0\) represents the expected value of \(y\) when \(x\) is equal to the baseline level.
  • The associated slope parameter \(\beta_1\) represents the expected amount by which \(y\) will change relative to the baseline level of \(x\).

Intervals and Significance

So far the parameter estimates have been point estimates, a single number that represents our best guess.

But this is a statistical model – there is always uncertainty (i.e., error). We can produce an interval estimate of the parameters, a range of numbers that represent our best guess.

# Include confidence intervals.
tidy(fit_01, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    29.8     0.537       55.5 2.92e-159    28.7      30.9 
## 2 discount        9.01    0.0466     193.  2.46e-315     8.92      9.10

These interval estimates are called confidence intervals. Given our model and the data, they are our best guess of the unobserved parameters.

If the confidence interval doesn’t include zero, we can say that the parameter estimate is statistically significant (a.k.a., significant or significantly different from zero).

We reach the same conclusion using a confidence interval as when using a p-value!

# Include confidence intervals.
tidy(fit_02, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    29.6      0.518      57.2 1.00e-162    28.6       30.7
## 2 coupon          9.07     0.617      14.7 3.13e- 37     7.86      10.3

Let’s Go on a Small Tangent…

We’ll examine the properties of point estimates and confidence intervals using simulation.

Model Comparison

Eventually we’ll want to compare how well different models fit the same data. To do that efficiently we need a single number that describes the overall model fit.

# Look at the r.squared.
glance(fit_01)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.992         0.992  4.73    37436. 2.46e-315     1  -891. 1787. 1798.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The \(R^2\) is the percent of variation in \(y\) that can be explained by the model (i.e., the explanatory variable(s)). Closer to 1 is better! This is easy to misuse, so it will be our measure of overall model fit only temporarily.

Prediction

While the parameter estimates are often the object of interest for an inferential model, we of course can predict the outcome using the fitted model:

\[sales = 29.60 + 9.07 \times coupon\]

To predict the outcome we need new data that represents the counterfactuals we’d like to predict to feed into the fitted model.

# Column names need to match the fitted model.
scenarios <- tibble(coupon = c(0, 1))

We can predict using the fitted model and the new data.

# Predict and bind on the new data.
predict(fit_02, new_data = scenarios) |> 
  bind_cols(scenarios)
## # A tibble: 2 × 2
##   .pred coupon
##   <dbl>  <dbl>
## 1  29.6      0
## 2  38.7      1

Remember that we have more than point estimates. We can compute confidence intervals for predictions as well (predictive intervals).

# Predict and bind on prediction intervals.
bind_cols(
  predict(fit_02, new_data = scenarios),
  predict(fit_02, new_data = scenarios, type = "pred_int"),
  scenarios
)
## # A tibble: 2 × 4
##   .pred .pred_lower .pred_upper coupon
##   <dbl>       <dbl>       <dbl>  <dbl>
## 1  29.6        20.0        39.3      0
## 2  38.7        29.1        48.3      1

Recover Parameters

When we work with real data, it can be hard to tell the difference between doing something wrong with the model and doing something wrong with the code.

Another reason to simulate data is to prove that our code is working by recovering the parameter values we used to simulate the data.

tidy(fit_01, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    29.8     0.537       55.5 2.92e-159    28.7      30.9 
## 2 discount        9.01    0.0466     193.  2.46e-315     8.92      9.10

Wrapping Up

Summary

  • Discussed interpreting parameter estimates and statistical significance.
  • Considered model comparison.
  • Walked through prediction.
  • Demonstrated parameter recovery.

Next Time

  • Adding explanatory variables.

Supplementary Material

  • Tidy Modeling with R Chapter 6.2-6.3

Artwork by @allison_horst

Exercise 10

  1. Simulate a single dataset with three variables: discount using a uniform distribution, coupon as a binary variable using the binomial distribution, and sales as a function of only discount, using a linear model.
  2. Fit two models using {tidymodels}: sales ~ discount and sales ~ discount + coupon.
  3. Evaluate model fit for both models. Compare parameter estimates and overall model fit. Which model is able to recover parameter values?
  4. Using the best-fitting model, predict some possible counterfactuals.
  5. Render the Quarto document into Word and upload to Canvas.